NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Architectures for Serial and Parallel Pipelined NTT-Based Polynomial Modular Multiplication

https://doi.org/10.1109/TVLSI.2025.3576782

Chiu, Sin-Wei; Parhi, Keshab K (June 2025, IEEE Transactions on Very Large Scale Integration (VLSI) Systems)

Quantum computers pose a significant threat to modern cryptographic systems by efficiently solving problems such as integer factorization through Shor’s algorithm. Homomorphic encryption (HE) schemes based on ring learning with errors (Ring-LWE) offer a quantum-resistant framework for secure computations on encrypted data. Many of these schemes rely on polynomial multiplication, which can be efficiently accelerated using the number theoretic transform (NTT) in leveled HE, ensuring practical performance for privacy-preserving applications. This article presents a novel NTT-based serial pipelined multiplier that achieves full-hardware utilization through interleaved folding, and overcomes the 50% under-utilization limitation of the conventional serial R2MDC architecture. In addition, it explores tradeoffs in pipelined parallel designs, including serial, 2-parallel, and 4-parallel architectures. Our designs leverage increased parallelism, efficient folding techniques, and optimizations for a selected constant modulus to achieve superior throughput (TP) compared with state-of-the-art implementations. While the serial fold design minimizes area consumption, the 4-parallel design maximizes TP. Experimental results on the Virtex-7 platform demonstrate that our architectures achieve at least 2.22 times higher TP/area for a polynomial length of 1024 and 1.84 times for a polynomial length of 4096 in the serial fold design, while the 4-parallel design achieves at least 2.78 times and 2.79 times, respectively. The efficiency gain is even more pronounced in TP squared over area, where the serial fold and 4-parallel designs outperform prior works by at least 4.98 times and 26.43 times for a polynomial length of 1024 and 6.7 times and 43.77 times for a polynomial length of 4096, respectively. These results highlight the effectiveness of our architectures in balancing performance, area efficiency, and flexibility, making them well-suited for high-speed cryptographic applications.
more » « less
Free, publicly-accessible full text available June 11, 2026
On Computing Linear, Positive-Wrapped (Circular), and Negative-Wrapped Convolutions in the Frequency Domain [Tips & Tricks]

https://doi.org/10.1109/MSP.2025.3544185

Chiu, Sin-Wei; Parhi, Keshab K (May 2025, IEEE Signal Processing Magazine)

Convolution is a fundamental operation with diverse applications in signal processing, computer vision, and machine learning. This article reviews three distinct convolutions: linear convolution (also referred to as aperiodic convolution), positive-wrapped convolution (PWC) (also known as circular convolution), and negative-wrapped convolution (NWC). Additionally, we propose an alternative approach to computing linear convolution without zero padding by leveraging the PWC and NWC. We compare two fast Fourier transform (FFT)-based methods to compute linear convolution: the traditional zero-padded PWC method and a new method based on the PWC and NWC. Through a detailed analysis of the flowgraphs (FGs), we demonstrate the equivalence of these methods while highlighting their unique characteristics. We show that computing the NWC using the weighted PWC method is equivalent to a part of the linear convolution computation with zero padding. Furthermore, it is possible to extract the PWC and NWC from structures to compute linear convolution with zero padding, where the last butterfly stage can be eliminated. This article aims to establish a clear connection among PWC, NWC, and linear convolution, illustrating new perspectives on computing different convolutions.
more » « less
Free, publicly-accessible full text available May 1, 2026
Low-Complexity NTT and INTT Structures via Twiddle Shifting

https://doi.org/10.1109/MWSCAS53549.2025.11244551

Chiu, Sin-Wei; Parhi, Keshab K (August 2025, IEEE Midwest Symposium on Circuits and Systems)

Polynomial modular multiplication is an important operation used in post-quantum cryptography and homomorphic encryption, which are based on ring learning with errors (RLWE) problems. For long polynomial lengths, this operation can be efficiently computed using number theoretic transform (NTT) and inverse NTT (INTT). In particular, negative wrapped convolution (NWC) has been proposed to compute this operation where zero padding is eliminated. Low-complexity structures for NTT (LCNTT) and INTT (LC-INTT) have been derived in prior work by using a divide-and-conquer approach. This paper presents an alternate derivation of the LC-NTT and LC-INTT structures from traditional NTT and INTT structures. Specifically, we show that using twiddle factor pushing (pulling) from left to right (right to left), we can derive the prior LC-NTT (LC-INTT) structures. We present systematic algorithms for twiddle factor pushing and pulling to derive the equivalent architectures. The alternate approach may provide opportunities for optimizing hardware implementations of polynomial modular multiplication.
more » « less
Free, publicly-accessible full text available August 10, 2026
Architectural Tradeoffs for Long Polynomial Modular Multiplication

https://doi.org/10.1109/IEEECONF60004.2024.10942651

Chiu, Sin-Wei; Parhi, Keshab K (October 2024, IEEE)

Polynomial multiplication over the quotient ring is a critical operation in Ring Learning with Errors (Ring-LWE) based cryptosystems that are used for post-quantum cryptography and homomorphic encryption. This operation can be efficiently implemented using number-theoretic transform (NTT)-based architectures. Among these, pipelined parallel NTTbased polynomial multipliers are attractive for cloud computing as these are well suited for high throughput and low latency applications. For a given polynomial length, a pipelined parallel NTT-based multiplier can be designed with varying degrees of parallelism, resulting in different tradeoffs. Higher parallelism reduces latency but increases area and power consumption,and vice versa. In this paper, we develop a predictive model based on synthesized results for pipelined parallel NTT-based polynomial multipliers and analyze design tradeoffs in terms of area, power, energy, area-time product, and area-energy product across parallelism levels up to 128. We predict that, for very long polynomials, area and power differences between designs with varying levels of parallelism become negligible. In contrast, areatime product and energy per polynomial multiplication decrease with increased parallelism. Our findings suggest that, given area and power constraints, the highest feasible level of parallelism optimizes latency, area-time product, and energy per polynomial multiplication.
more » « less
Full Text Available
Long Polynomial Modular Multiplication Using Low-Complexity Number Theoretic Transform [Lecture Notes]

https://doi.org/10.1109/MSP.2024.3368239

Chiu, Sin-Wei; Parhi, Keshab K. (January 2024, IEEE Signal Processing Magazine)

This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and the discrete Fourier transform (DFT). The main goal is to extend the well-known theory of the DFT in signal processing (SP) to other applications involving polynomials in a ring, such as homomorphic encryption (HE).
more » « less
Full Text Available
Low-Latency Preprocessing Architecture for Residue Number System via Flexible Barrett Reduction for Homomorphic Encryption

https://doi.org/10.1109/TCSII.2023.3344604

Chiu, Sin-Wei; Parhi, Keshab K. (December 2023, IEEE Transactions on Circuits and Systems II: Express Briefs)

Data privacy has become a significant concern due to the rapid development of cloud services, Internet of Things, edge devices, and other applications. Homomorphic encryption (HE) addresses the issue by enabling computations to be performed without the decryption of the encrypted message. However, the bottleneck of designing homomorphic encryption hardware is the complexity of computation. To tackle the long integer arithmetic, the residue number system based on the Chinese remainder theorem is used. In this paper, we propose a novel modular reduction architecture that computes the mapping of residual polynomials in parallel with high speed and low latency. We implement our proposed design in the Xilinx Ultrascale+ FPGA board (VCU118). When the input sizes are 360-bit (1440-bit), the frequency is 180MHz (168MHz) with 4 pipelining stages. Also, the area delay product (ADP) of DSP blocks of our design is reduced by 23 and 31 percent, respectively, for 360 and 1440 bits, compared to prior work.
more » « less
Full Text Available
PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

https://doi.org/10.1109/TIFS.2023.3338553

Tan, Weihang; Chiu, Sin-Wei; Wang, Antian; Lao, Yingjie; Parhi, Keshab K. (January 2023, IEEE Transactions on Information Forensics and Security)

High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes three novel contributions. First, cascaded parallel NTT and iNTT architectures are proposed such that any buffer requirement for permuting the product of the NTTs before it is input to the iNTT is eliminated. This is achieved by using different folding sets for the NTTs and iNTT. Second, a novel approach to expand the set of feasible special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Third, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate such as in the cloud, as these feed-forward architectures can be pipelined at arbitrary levels. Pipelining and latency tradeoffs are also investigated. Compared to a prior design, the proposed architecture reduces latency by a factor of 49.2, and the area-time products (ATP) for the lookup table and DSP, ATP(LUT) and ATP(DSP), respectively, by 89.2% and 92.5%. Specifically, we show that for n =4096 and a 180-bit coefficient, the proposed 2-parallel architecture requires 6.3 Watts of power while operating at 240 MHz, with 6 moduli, each of length 30 bits, using Xilinx Virtex Ultrascale+ FPGA.
more » « less
Full Text Available

Search for: All records